Empirical Methods in Information Extraction

نویسنده

  • Claire Cardie
چکیده

Most corpus-basedmethods in natural language processing (NLP)were developed toprovide an arbitrary text-understanding application with one or more general-purpose linguistic capabilities. This is evident from the articles in this issue of AI Magazine. Charniak and Ng/Zelle, for example, describe techniques for part-of-speech tagging, parsing, and word-sense disambiguation. These techniques were created with no specific domain or high-level language-processing task in mind. In contrast, this article surveys the use of empirical methods for a particular natural language understanding task that is inherently domain-specific. The task is information extraction. Very generally, an information extraction system takes as input an unrestricted text and “summarizes” the text with respect to a prespecified topic or domain of interest: it finds useful information about the domain and encodes that information in a structured form, suitable for populating databases. In contrast to in-depth natural language understanding tasks, information extraction systems effectively skim a text to find relevant sections and then focus only on these sections in subsequent processing. The information extraction system in Figure 1, for example, summarizes stories about natural disasters, extracting for each such event the type of disaster, the date and time that it occurred, and data on any property damage or human injury caused by the event. Information extraction has figured prominently in the field of empirical NLP: The first largescale, head-to-head evaluations of NLP systems on the same text-understanding tasks were the DARPA-sponsored MUC performance evaluations of information extraction systems (Lehnert and Sundheim, 1991; Chinchor et al., 1993). Prior to each evaluation, all participating sites receive a corpus of texts from a predefined domain and the corresponding “answer keys” to use for system development. The answer keys are manually encoded templates — much like that of Figure 1 — that capture all information from the corresponding source text that is relevant to the domain, as specified in a set of written guidelines. After a short development phase, the NLP systems are evaluated by comparing the summaries each produces with the summaries generated by human experts for the same test set of previously unseen texts. The comparison is performed using an automated scoring program that rates each system according to measures of recall and precision. Recallmeasures the amount of the relevant information that theNLP system correctly extracts from the test collection while precision measures the reliability of the information extracted:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A review on EEG based brain computer interface systems feature extraction methods

The brain – computer interface (BCI) provides a communicational channel between human and machine. Most of these systems are based on brain activities. Brain Computer-Interfacing is a methodology that provides a way for communication with the outside environment using the brain thoughts. The success of this methodology depends on the selection of methods to process the brain signals in each pha...

متن کامل

A review on EEG based brain computer interface systems feature extraction methods

The brain – computer interface (BCI) provides a communicational channel between human and machine. Most of these systems are based on brain activities. Brain Computer-Interfacing is a methodology that provides a way for communication with the outside environment using the brain thoughts. The success of this methodology depends on the selection of methods to process the brain signals in each pha...

متن کامل

Detection of perturbed quantization (PQ) steganography based on empirical matrix

Perturbed Quantization (PQ) steganography scheme is almost undetectable with the current steganalysis methods. We present a new steganalysis method for detection of this data hiding algorithm. We show that the PQ method distorts the dependencies of DCT coefficient values; especially changes much lower than significant bit planes. For steganalysis of PQ, we propose features extraction from the e...

متن کامل

Presenting an Empirical Correlation for Maximum Sauter Mean Diameter in a Spray Extraction Column

Based on the importance of drops' behavior in liquid-liquid extraction, the maximum sauter mean drop diameter has been investigated and correlated in a counter-current spray extraction column with two chemical systems. Spargers were set of nozzles in all experiments. Studying the effects of several parameters on drops size, some correlations were estimated by the last available version of softw...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

An overview of empirical natural language processing.(Natural Language

In recent years, there has been a resurgence in research on empirical methods in natural language processing. These methods employ learning techniques to automatically extract linguistic knowledge from natural language corpora rather than require the system developer to manually encode the requisite knowledge. The current special issue reviews recent research in empirical methods in speech reco...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • AI Magazine

دوره 18  شماره 

صفحات  -

تاریخ انتشار 1997